{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lab 3 - Bar charts and green taxi trips\n",
"\n",
"Green taxis opporate primarily in upper Manhattan and the other four boroughs, and NYC Open Data has a dataset of all green taxi trips taken in 2018.\n",
"\n",
"### Getting the data\n",
"\n",
"The NYC Open Data datast of all 2018 green taxi trips is here: [https://data.cityofnewyork.us/Transportation/2018-Green-Taxi-Trip-Data/w7fs-fd9i](https://data.cityofnewyork.us/Transportation/2018-Green-Taxi-Trip-Data/w7fs-fd9i)\n",
"\n",
"The dataset contains almost 9 million rows, so as in Lab 2, we will filter the data to only be trips from Sept. 3, 2018 to make the dataset smaller. To do this:\n",
"- Click on the \"Filter\" button.\n",
"- On the menu that appear, click on \"Add a New Filter Condition\".\n",
"- Choose \"lpep_pickup_datetime\" but change the \"is\" to be \"is between\".\n",
"- Click in the box below and a calendar will pop up. Highlight September 3, 2018.\n",
"- Click the second box below and a calendar will pop up. Highlight September 4, 2018.\n",
"- It will take a few seconds (it's a large file) but the rows on the left will be filtered to be all trips with pickups between Sept. 3 2018 at 12am and Sept. 4 2018 at 12am, or all counts with pickups on Sept. 3.\n",
"\n",
"To download the file,\n",
"- Click on the \"Export\" button.\n",
"- Under \"Download\", choose \"CSV\".\n",
"- The download will begin automatically (files are usually stored in \"Downloads\" folder).\n",
"\n",
"Upload your CSV file to Jupyter Hub, and open it to see that it has been downloaded correctly. (You can also do this in Excel or Text Edit before uploading the file.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bar Charts\n",
"\n",
"Bar charts are used to graph qualitative (categorical) data. A bar chart visually shows the number of pieces of data in each category.\n",
"\n",
"As in Labs 1 and 2, we need to import the matplotlib and pandas packages and tell Jupyter to display the plots. Run the code below to do this."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can you write the code to store the data from the CSV file in a dataframe called `taxi`? Getting the data from a filing is called *reading the file*. The columns `lpep_pickup_datetime` and `lpep_dropoff_datetime` should be processed as dates."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Answer:
\n",
"taxi = pd.read_csv(\"taxi_trip_filename.csv\",parse_dates=[\"lpep_pickup_datetime\",\"lpep_dropoff_datetime\"])\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that the dataframe was created correctly by displaying the first five rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Hint:
\n",
" Use the head() function.\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are descriptions of the dataframe columns: [http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf](http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf) Which columns are qualitative variables?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will plot the `payment_type` column as a bar chart. First we need to count how many trips used each payment type. As in Lab 2, write a piece of code to display only the `payment_type` column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Answer:
\n",
"taxi[\"payment_type\"]\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can count the number of times each different value (ex. 1, 2) appears in the `payment_type` column using the code `taxi[\"payment_type\"].value_counts()`. Type and run this code below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What was the most used payment type? Refer to the column information to see which payment type corresponds to which number: [http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf](http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf) \n",
"\n",
"We can visualize these counts with a bar chart. First we will store the counts of the values in a variable.\n",
"Type and run the following code: `pay_counts = taxi[\"payment_type\"].value_counts()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What happened? \n",
"\n",
"The code didn't display anything, so it looks like nothing happened. However, we can check that the variable was created by typing its name and running the code:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will plot these counts. Type and run the code `pay_counts.bar(kind = \"bar\")` below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why can you barely see the bars for payment types 3, 4, and 5?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try typing and running the following code below `pay_counts.plot(kind = \"barh\")`. It is the same code but with `barh` instead of `bar`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What does the parameter `kind = \"barh\"` do?\n",
"\n",
"The `RatecodeID` column holds information about how the trip fare is calculated. For example, 1 means it is calculated using the standard rate, while 2 means it is the flat rate for JFK to/from Manhattan. See [http://www.nyc.gov/html/tlc/html/passenger/taxicab_rate.shtml](http://www.nyc.gov/html/tlc/html/passenger/taxicab_rate.shtml) for a description of the different rate codes.\n",
"\n",
"Can you write code to make a bar chart the number of trips using each `RatecodeID`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Answer:
\n",
" code_counts = taxi['RatecodeID'].value_counts()\n",
" code_counts.plot.bar()\n",
" \n",
"\n",
"What rate code is used the most? Does this make sense? What rate code is used the second most?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Challenges\n",
"- Make a bar chart of one of the other categorical columns in `taxi` dataframe.\n",
"- We can make different types of plots by changing the parameter for the kind of plot. A list of all of the different kinds of plots is here: [https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#other-plots](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#other-plots). Can you make a pie plot for the payment type?\n",
"- Find another dataset on NYC Open Data with qualitative (categorical) data and make a bar chart of that data."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}